{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Class-weighted Learning\n",
    "- Class-imbalance problem is actually a quite common problem. For instance, there are much more purchasers among mobile app users and much more non-criminals than criminals in society.\n",
    "- However, if class imbalance is too severe (i.e., training set is highly skewed), it is likely to  bear undesirable effects. \n",
    "    - For instance, algorithm will tend to vote for majority class, all the time.\n",
    "    - This is highly risky since we might lose track of purchasers among mobile app users and criminals, which are relatively rare among training instances"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.utils import class_weight\n",
    "from sklearn.datasets import load_breast_cancer\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score\n",
    "from collections import Counter\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense\n",
    "from keras.optimizers import adam"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Dataset\n",
    "- Breast cancer dataset in ```sklearn```\n",
    "- doc: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = load_breast_cancer()\n",
    "X_data = data.data.tolist()\n",
    "y_data = data.target.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of malignant instances (0):  212\n",
      "Number of benign instances (1):  357\n"
     ]
    }
   ],
   "source": [
    "print(\"Number of malignant instances (0): \", Counter(y_data)[0])\n",
    "print(\"Number of benign instances (1): \", Counter(y_data)[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "# delete some of malignant instances to generate class-imbalance situation artificially\n",
    "for i in range(200):\n",
    "    if y_data[i] == 0:\n",
    "        X_data[i] = None\n",
    "        y_data[i] = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_data = [x for x in X_data if x != None]\n",
    "y_data = [y for y in y_data if y != None]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of malignant instances (0):  108\n",
      "Number of benign instances (1):  357\n"
     ]
    }
   ],
   "source": [
    "print(\"Number of malignant instances (0): \", Counter(y_data)[0])\n",
    "print(\"Number of benign instances (1): \", Counter(y_data)[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(372, 30) (93, 30) (372,) (93,)\n"
     ]
    }
   ],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(np.asarray(X_data), np.asarray(y_data), test_size = 0.2, random_state = 7) \n",
    "\n",
    "print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing class weights\n",
    "- We compute class weights based on training dataset, and deliver as parameter when fitting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [],
   "source": [
    "weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Computed class weights:  {0: 2.2409638554216866, 1: 0.643598615916955}\n"
     ]
    }
   ],
   "source": [
    "class_weights = dict(zip(np.unique(y_train), weights))\n",
    "print(\"Computed class weights: \", class_weights)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Naive Learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def simple_mlp():\n",
    "    model = Sequential()\n",
    "    model.add(Dense(10, input_shape = (X_train.shape[1],), activation = 'relu'))\n",
    "    model.add(Dense(1, activation = 'sigmoid'))\n",
    "    model.compile(optimizer = adam(lr = 0.001), loss = 'binary_crossentropy', metrics = ['acc'])\n",
    "    return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = simple_mlp()\n",
    "model.fit(X_train, y_train, epochs = 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = model.predict(X_test).round()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "% of predicted 1's:  1.0\n",
      "Overall Accuracy Score:  0.731182795699\n"
     ]
    }
   ],
   "source": [
    "print(\"% of predicted 1's: \", y_pred.sum()/len(y_pred))\n",
    "print(\"Overall Accuracy Score: \", accuracy_score(y_pred, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Class-weighted learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = simple_mlp()\n",
    "model.fit(X_train, y_train, epochs = 100, class_weight = class_weights)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y_pred = model.predict(X_test).round()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "% of predicted 1's:  0.763440860215\n",
      "Overall Accuracy Score:  0.967741935484\n"
     ]
    }
   ],
   "source": [
    "print(\"% of predicted 1's: \", y_pred.sum()/len(y_pred))\n",
    "print(\"Overall Accuracy Score: \", accuracy_score(y_pred, y_test))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
